pH trends and seasonal cycle in the coastal Balearic Sea reconstructed through machine learning

The decreasing seawater pH trend associated with increasing atmospheric carbon dioxide levels is an issue of concern due to possible negative consequences for marine organisms, especially calcifiers. Globally, coastal areas represent important transitional land-ocean zones with complex interactions between biological, physical and chemical processes. Here, we evaluated the pH variability at two sites in the coastal area of the Balearic Sea (Western Mediterranean). High resolution pH data along with temperature, salinity, and also dissolved oxygen were obtained with autonomous sensors from 2018 to 2021 in order to determine the temporal pH variability and the principal drivers involved. By using environmental datasets of temperature, salinity and dissolved oxygen, Recurrent Neural Networks were trained to predict pH and fill data gaps. Longer environmental time series (2012–2021) were used to obtain the pH trend using reconstructed data. The best predictions show a rate of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-\,0.0020\pm 0.00054$$\end{document}-0.0020±0.00054 pH units year\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{-1}$$\end{document}-1, which is in good agreement with other observations of pH rates in coastal areas. The methodology presented here opens the possibility to obtain pH trends when only limited pH observations are available, if other variables are accessible. Potentially, this could be a way to reliably fill the unavoidable gaps present in time series data provided by sensors.

Atmospheric carbon dioxide (CO 2 ) emissions are exponentially increasing since the industrial revolution, principally due to fossil fuel use, industry and land-use change. Around a 46% of this CO 2 remains in the atmosphere while the rest is captured by natural compartments: the terrestrial biosphere and the ocean 1 . At present, the oceans have absorbed around an estimated 26% of the total anthropogenic CO 2 released from 2011 to 2020 1 . Once CO 2 dissolves in seawater, a sequence of chemical reactions occurs that derives in an increase of [H + ] ions, which results in a decrease in seawater pH. This process, a consequence of increasing atmospheric CO 2 , is termed Ocean Acidification (OA) 2 . In addition to the pH decrease, [H + ] ions react with carbonate ions [CO 2− 3 ] to form [HCO − 3 ], leading to a reduction of the [CO 2− 3 ] ion levels 3 . Low carbonate levels affect the saturation state of calcium carbonate minerals, increasing difficulties in shell-forming for calcifying marine organisms (e.g., plankton, mollusks, echinoderms and corals). Consequences of OA are an important threat to marine ecosystems visible in higher levels of the trophic chain, with complex and wide-ranging impacts on the physiology of different species and therefore on their metabolism 4,5 . These metabolic effects will have numerous consequences at an organism scale, in particular, they can cause a decrease in growth, locomotion, reproductive capacity and homeostasis if they are not capable to control the conditions for calcification 6 . Negative effects of this magnitude could cause an unexpected cascade effect impacting on the structure and functions of ecosystems and trophic networks 7 and cannot be easily generalized.
Also, ocean CO 2 uptake and derived OA are not homogeneous at the global scale, with some areas more affected. For instance, the Mediterranean Basin is an area where effects are stronger compared to the global ocean 8 . The Mediterranean Sea, constituting only a 0.82% of the surface and 0.32% of the volume of the global ocean, is cataloged as one of the most complex marine ecosystems, defined as a "miniature ocean" 9  www.nature.com/scientificreports/ study aims to provide a useful tool to fill gaps in pH time series and to reconstruct pH data when additional environmental variables are available.

Results
Time series data. The Fig. 1a-c).  www.nature.com/scientificreports/ In both stations, temperature ranged from a minimum of 12.99 • C to maximum values of 29.07 • C from 2012 to 2021, with no observed differences between the stations in Cabrera and the Bay of Palma (Fig. 1a). The surface water temperatures are a clear representation of the typical Mediterranean climate seasonality with mild winters and warm to hot summers. Salinity did not show a repetitive seasonal pattern between years in either stations. However, in Cabrera salinity is slightly lower than in the Bay of Palma. During the data acquisition period, the lowest salinity value of 36.83 was found in Cabrera and highest of 38.30 in the Bay of Palma (Fig. 1b).
The surface water of the coastal sites in the Balearic Sea in the Palma Bay and the Cabrera stations was highly saturated with oxygen during all the seasons, with DO concentrations up to 348.94 µmol kg −1 during winter and of 150.66 µmol kg −1 during the summer and early autumn (Fig. 1c). pH T values obtained starting in December, 2018 to December, 2021 increased during winter reaching up to 8.18 pH units at in situ temperature and decreasing to 7.91 pH units in summer, with the highest variability and maximum and minimum values measured in Cabrera (Fig. 1d).
Considering sampling period of the additional (temperature and salinity) and calculated parameters (Total Alkalinity; TA) was larger in the Bay of Palma compared to Cabrera, we evaluate the linear tendencies with time for the Bay of Palma variables. The sea surface temperature in the Bay of Palma increased with a rate of 0.035 ± 0.008 • C per year (R 2 = 0.008, p-value < 0.001) from 2012 to 2021, whereas the salinity decreased significantly with − 0.059 ± 0.002 psu per year (R 2 = 0.25 , p-value < 0.001). The annual trend for TA, clearly related to the decrease in surface salinity, showed a relevant decrease of − 4.0 ± 0.4 µmol kg −1 (R 2 = 0.0379 , p-value < 0.001), supported by the discrete water samples for TA obtained during the period from 2019 to 2021 (Fig. S3).
Reconstruction pH time series with deep learning. The amount of available pH T data from both Palma Bay and Cabrera stations is comparable and relatively short (mostly in Cabrera), but the length of the additional ambient data (temperature, salinity and DO) differs enormously among stations. Thus, there is a need to approach the time series prediction problem for both sites with different objectives. Common to both sites, a DL model with a RNN architecture will be developed to predict the pH T time series from the accompanying ambient data (temperature, salinity and dissolved oxygen), which are expected to be correlated with pH T 42,47 . To avoid the effect of site-specific correlations between ambient data and pH T time series, the model will be trained independently with the dataset of each location. In this way a proper model calibration is ensured and the prediction power of the model is enhanced. In the Bay of Palma, the model will be used to reconstruct the pH T time series from 2012, exclusively from the points for which the full set of ambient time series data are available (Fig. 1d). This is not possible in Cabrera, due to the fact that no temperature, salinity and DO concentration is available before 2019. Fortunately, these time series do not have the same gaps that the pH T time series exhibits. Thus, we will use the model to fill the gaps in the pH T time series from 2019 to present, as shown in (Fig. 1d).
A BiDireccional Long Short-Term Memory (BD-LSTM) neural network (Fig. S4) was selected as the best recurrent neural network architecture to reconstruct the pH T time series in the Bay of Palma. The training process was successfully completed with no signs of overfitting achieving less than 1% error in both training and validation sets (Fig. 2a). The BD-LSTM neural network was able to fairly predict the majority of the individual pH data points in the time series, although there are some deviations (Fig. 2b). Furthermore, the time series pattern is perfectly captured by the neural network (Fig. 2c). Notice the gaps in the reconstructed pH points (in red) in (Fig. 2d), that are those for which the full ambient time-series is not available. Finally, the reconstructed pH data using the BD-LSTM model was used to assess the decadal trend of acidification in Palma Bay, which yielded −0.0020 units of pH per year (black line in Fig. 2d). Indeed, to further characterize this decadal trend, 1000 independent training-prediction processes were carried out using a BD-LSTM neural network. The results showed a mean slope of −0.0020 ± 0.00054 for the decadal acidification trend (see "Methods").
Regarding the Cabrera data set, with the available ambient time series it is only possible to fill the data gaps, task for which a BD-LSTM neural network was also used. As for the Bay of Palma the training process was successfully completed with no signs of overfitting, yielding less than 1% error in both training and validation dataset (Fig. 3a). The model fairly predicts most of the individual pH T data points in the training dataset, showing some deviations as usual (Fig. 3b). The tendency of the time series is perfectly captured by the model (Fig. 3c) and thus the gap can be filled with reliable data, red points in (Fig. 3d).

Discussion
The achievement of long term oceanographic data series suitable to evaluate the effects of climate change constitutes a great operational effort which is unequivocally accompanied by partial data loss due to multiple factors (human and instrumental). The advances in the development of pH sensors are enabling the acquisition of precise pH data without identified drift through highly accurate indicator-based spectrophotometric methods 49 . However, in order to determine OA trends, several years of quality seawater pH data are needed, adding more difficulty to the vicissitudes inherent to field work. Recently, the application of computational methods based on Deep Learning (DL) is becoming a useful tool to fill the gaps due to data loss. Several studies have implemented the DL methodology and successfully predicted bio-optical and biogeochemical parameters 44,[46][47][48][50][51][52][53][54] .
Here, the application of a BiDireccional Long Short-Term Memory (BD-LSTM) neural network to predict pH T from physical data, namely temperature, salinity, and dissolved oxygen, the latter as a key indicator of biological activity, permitted the reconstruction of gaps in the time series of pH T and allowed the reconstruction of nine years of pH T data. The BD-LSTM architecture has been proved extremely effective in predicting sequence data, such as time series, as they combine the information for both front and back directions of time ( Fig. S4) 55 , and is more effective (accurate and stable) compared to unidirectional Long Short-Term Memory neural networks. Therefore, in this study the BD-LSTM offered better estimation results over the other neural networks considered to reconstruct time series but also in the completion of missing data. www.nature.com/scientificreports/ In the Cabrera station, the BD-LSTM permitted a reliable reconstruction of the gaps in pH T data from December 2019 to June 2020 (Fig. 1d), constituting an advantageous methodology to support the acquisition of long time series data without losing accuracy, as the model can reproduce pH T data with an error lower than 1% (Fig. 3b), closely following the annual variability of the observations (Fig. 3d).
The ability of the BD-LSTM to reconstruct time series was observed through the reconstruction of nine years of pH T data in the Bay of Palma station (Fig. 2d). The modeled pH T data combined with the observations allowed the accomplishment of a long pH time series in order to estimate a pH trend, seasonally adjusted through a sinusoidal fitting, with a rate of decrease of 0.0020 ± 0.00054 pH units per year ( R 2 = 0.1 , p-value < 0.001 , Fig. S1), and represents the first estimate of pH trend obtained in the Balearic coastal Sea. Additionally, we applied a linear fit on the reconstructed pH time series obtaining trend of −0.0025 ± 0.00053 year −1 ( R 2 = 0.01 , p-value < 0.001 ). This fit was discarded, because it was shown to introduce a bias in the pH decrease trend.
The observed pH decrease in the Balearic Sea coastal area is well aligned with OA trends reported for open ocean areas, from − 0.0013 pH units year −1 in the Munida station (New Zealand) to the high trend found in the Cariaco Basin station up to − 0.0026 pH units per year 27 . The processes associated with the increased pH decline in the Cariaco Basin where related to the upwelling of Subtropical Underwater, rich in dissolved inorganic carbon, thus lowering the pH.
In the Mediterranean Sea, previous annual estimates in open ocean areas ranged from − 0.003 to − 0.0044 19,22 , reflecting the effect of the hydrodynamical and biogeochemical characteristics of the basin on the seawater pH variability 18,56,57 . However, it can be assumed that differences in physical oceanography and ecological processes between areas may modulate local changes of pH. In a coastal Mediterranean area located in the northwestern basin, close to Villefranche-sur-Mer (France), a rate of pH change of −0.0028 ± 0.0003 pH units year −1 was observed 21 and attributed principally to atmospheric forcing and secondly to increased warming. The calculated trend of pH decrease due to the atmospheric CO 2 growth during the period of this study, from 2013 to 2021, was of 0.0025 ± 0.0002 pH units per year (R 2 = 0.95 , p-value < 0.001), consistently related to the seawater pH decline. Therefore, these analyses suggest that the atmospheric forcing is the main driver responsible for the pH decreasing trend found in the surface coastal Balearic Sea. Subsequently, the difference between the seawater pH decreasing trend obtained and the pH trend calculated from the atmospheric levels could be related to natural biogeochemical processes, not distinctly quantifiable with the available length of the Bay of Palma pH time series.
In addition, the effect of temperature on surface ocean pH, occurring directly through the temperature dependence of the seawater CO 2 chemistry, as changes in temperature and salinity influence the equilibrium www.nature.com/scientificreports/ constants of the oceanic CO 2 system and indirectly through air-sea exchange of CO 2 , can be considered. The influences of these two temperature processes on surface ocean pH has been found responsible of a 50% of the increase in [H + ] ions, thus a pH decrease, in the surface layers of the Iceland and Irminger Seas 58 . In the Mediterranean Sea northwestern basin, a temperature increase of 0.072 ± 0.022 • C year −1 was estimated to be responsible for a 40% of the pH decrease 21 . The obtained temperature variability in the Balearic Sea coastal area during this study was of 0.035 ± 0.008 ( R 2 = 0.008 , p-value < 0.001 , Fig. S2), indicating that temperature-driven changes could also be assumed to affect the pH trend. The observed seasonal variability of the data, presented a pH T increase from 7.91 during summer up to 8.18 pH units (Fig. 1d) in winter seasons, clearly followed by the TA values (Fig. S3). Seasonal changes in TA levels in the study area are ranging from around 2350 to 2550 µmol kg −1 (Fig. S3), largely overtaking the seasonal differences reported previously in the Balearic Sea of up to 50 µmol kg −1 in total 59 . This discrepancy in variability could be explained by the intense metabolic processes at the coastal location of the Bay of Palma station. This shallow area has a strong coverage of Posidonia oceanica, which due to its high ecosystem production 60 could be triggering an increase of pH and TA levels, as seen in salinity normalized TA values (NTA, not shown) during winter-spring, due to the uptake of nitrate and phosphate and the calcium carbonate dissolution 59,61 , and during summer, related to the lower community production 62 a NTA-pH decrease 59 .
Another result from this study worth to mentioning is the obtained decreasing TA trend in the Bay of Palma of − 4.0 ± 0.4 µmol kg −1 per year. Although the Western Mediterranean is characterized with lower total alkalinity values in relation to the rest of the basin, resulted from the nearby influence of Atlantic waters, less salty with low-alkalinity water 63,64 , was not expected to influence decreasing decadal TA values. In the northwestern basin, TA values increased over time at a rate of 2.08 ± 0.19 µmol kg −1 year −1 . In the Balearic Sea, the decreasing TA confirm the Atlantic forcing on the alkalinity values and the negligible TA discharges due to rivers in the Balearic Islands. There is a marked south-to-north surface gradient in the western region coupled with the west-to-east gradient of alkalinity in the Mediterranean Sea related to the Atlantic influence 59,65 . Due to a well-established linear relationship of TA and salinity 66 and the calculated origin of our values 65 we cannot neglect the strong TA related to the salinity decrease in the study area of − 0.059 ± 0.002 psu per year (R 2 = 0.25 , p-value < 0.001). This rate is in agreement with the salinity decrease found at the coastal site at Villefranche-sur-Mer (− 0.0017 ± 0.0044 psu year −1 ) 21 . Notwithstanding, the intense salinity decrease observed in the Bay of Palma can be linked to a decrease of the intensity of the southern spreading of the Balearic Current trough the Ibiza channel (located between Ibiza and Mallorca Islands) driven by mesoscale processes, and the prevalence of new Atlantic Water www.nature.com/scientificreports/ coming from the Strait of Gibraltar 67 . Although, this observation is out of the scope of this study and therefore further investigation is needed. In summary, this work pointed out the useful use of DL techniques, specifically the BD-LSTM architecture, to reconstruct pH data relevant to evaluate seasonal pH variability and to elucidate the climate change consequences, as the OA effect, in a coastal area of the Balearic Sea, which can be extended to the coastal areas of the Western Mediterranean Sea Basin. Nevertheless, future research is necessary to assess and confirm these regional trends, which highlights the importance of maintaining the time series monitoring networks whose data are the base of this study.

Methods
Study area. We monitored two coastal stations located in the Archipelago of the Balearic Islands in the Western Mediterranean Sea (Fig. 4). One site was positioned within the Bay of Palma (39 • 29.57088 ′ N; 2 • 42.02430 ′ E (Fig. 4b) at a fixed station consisting of an oceanographic buoy managed by the Balearic Islands Coastal Ocean Observing and Forecasting System (SOCIB; https:// www. socib. eu/). Here meteorological, hydrological and hydrodynamic data are collected with an hourly frequency since October 2012. The buoy is located at the surface over 20m bottom depth. The Bay of Palma is a large bay with a surface area of 217 km 2 and approximately 30% seagrass cover 68  Temperature and salinity from the Bay of Palma oceanographic buoy was obtained from October 2012 and for the Cabrera mooring line from November 2019 with a CT SBE37 (Sea-Bird Scientific©) in both stations. Accuracy of the CT is ± 0.002 • C for temperature and ± 0.003 mS cm −1 for conductivity. Additionally, oxygen data from a SBE 63 (Sea-Bird Scientific ©) sensor attached to the CT in Cabrera and from a YSI 6600V2-4 Multiparameter Water Quality Sonde with a 6450 ROX DO sensor (Yellow Spring Instruments Inc. ©) 71 and a miniDot (PME, Inc. ©) in the Bay of Palma were used. Accuracy of oxygen sensors is ± 2%, ± 1% and ± 5% for the SBE 63, the YSI and the miniDot, respectively.
Periodically water samplings for dissolved oxygen (DO), pH in total scale at 25 • C (pH T25 ) and total alkalinity (TA) were obtained during the sensor maintenance campaigns. DO and (pH T25 ) samples were collected in order to validate the data obtained by the sensors.
DO concentrations were evaluated with the Winkler method modified by Benson and Krause 72 by potentiometric titration with a Metrohm 808 Titrando with a accuracy of the method of ± 2.9 µmol kg −1 and with an obtained standard deviation from the sensors data and the water samples collected of ± 5.9 µmol kg −1 .
pH T25 data was obtained by the spectrophotometric method with a Shimadzu UV-2501 spectrophotometer containing a 25 • C-thermostated cells with unpurified m-cresol purple as indicator following the methodology  74 . TA values were also calculated from the temperature and salinity values obtained in the Bay of Palma from 2012 by using a second-order polynomial model for TA specifically described for the Mediterranean Basin 65 .
pH values due to the atmospheric CO 2 levels were estimated by using the CO2SYSv3 program 75 , with the most internally consistent and preferred carbon 76,77 and sulphate dissociation constants 78 for current surface ocean studies 79 , with the Bay of Palma in situ temperature and salinity, the calculated TA values and the atmospheric CO 2 levels converted from dry air to wet 80 as inputs. Carbon dioxide (CO 2 ) atmospheric molar fraction used was obtained from the monitoring station of Lampedusa (LMP), Italy of the NOAA (National Oceanic and Atmospheric Administration, USA) monitoring network 81 .
Data processing. Once data was validated, several processing steps were performed to ensure an optimal training process for the neural network models. First, all the data of the time series were re-sampled by averaging the data points obtaining a daily frequency. Afterwards, a standard feature-scaling procedure (min-max normalization) was applied to every feature (temperature, salinity and oxygen) and to pH T . Finally, we built our training and validations sets as tensors with dimensions (batch size , window size , N features ) , where batch size is the number of examples to train per iteration, window size is the number of past and future points considered and N features is the number of features used to predict the target series. Temperature values below T = 12.5 °C were discarded as they are considered outliers in sensor data outside the normal range in the study area.
Computing the trend of seasonal data. The trend of seasonal time-series is often computed by means of statistical methods based on moving averages or more advanced techniques such as the Seasonal Trend Decomposition Loess 82 . Nevertheless, these procedures do not work with gappy time series, so that a different approach is needed. In this work we fitted the following oscillatory function with trend to our data: where the parameter B corresponds to the trend of the data.
Moreover, after this fit, the seasonal component ( A sin(ωt + φ) ) can be removed from the original time-series and a standard linear regression can be performed to the transformed data to obtain the trend (which is exactly B) with the R 2 and p-value estimates given by the linear regression (Figs. S1, S2).
Selecting the best neural network architecture. Several recurrent neural network (RNN) architectures were considered as candidates to reconstruct the pH time series, including a Simple Recurrent neural network (SRNN), Long-Short Term Memory (LSTM), BiDirectional LSTM (BD-LSTM, (Fig. S4) and BiDirectional Gated Recurrent Unit (BD-GRU).
Initially, manual tests were performed on each architecture to determine the optimal set of parameters that yielded the best possible results. These tests were based on minimizing the errors in both training and validation set while avoiding overfitting. To avoid overfitting, we implemented automated callbacks to stop the training process whenever the validation loss increased or crossed the training loss. During this test we determined the minimum number of nodes, which helps in avoiding overfitting and the minimum window size, which allows to use the most possible number of data points for training and prediction. All the RNNs were trained in batches of size 32. To enhance clarity and accessibility, the optimal values obtained for the more relevant parameters are summarized in Table 1.
In order to identify the best-performing architecture an automated procedure was developed to statistically compare the outputs of each model. Each architecture was trained in 1000 independent processes, ensuring a final training mean-squared error of less than 0.8% while avoiding overfitting implementing the previously mentioned callbacks. The code used for the analysis can be found in 83 .
In Table 2, a summary of the statistical results obtained for each architecture is presented. All architectures provide similar training and validation errors and provide similar results for the decadal pH T trend, predicting a slope of around −0.0020 pH units per year with an intercept of 8.07 pH units. However, the BD-LSTM turns out to be the architecture providing most accurate (smallest training and validation errors) and precise (smallest (1) y(t) = A sin(ωt + φ) + Bt + C , www.nature.com/scientificreports/ statistical error) results (Fig. S4). Thus, we selected the BD-LSTM neural network as the best architecture to reconstruct the pH T time series. Data corresponding to the Bay of Palma were used in the selection of the best neural network architecture. The code and data used to determine the best neural network architecture can be found in a GitHub repository 83 .  www.nature.com/scientificreports/